;;; -*- Mode: TEXT -*-
;;; File: AutoClass:doc;interpretation.text
;;;————————————————————————-;;;
;;; AUTOCLASS 3.0 Released 5/11/90 contact: Taylor@pluto.arc.nasa.gov ;;;
;;; by P. Cheeseman, J. Stutz, R. Hanson, W. Taylor ;;;
;;; NASA Ames Research Center, MS 244-17, Moffett Field, CA 94035 ;;;
;;; ;;;
;;; Copyright (C) 1990 Research Institute for Advanced Computer Science. ;;;
;;; All rights reserved. The RIACS Software Policy contains specific ;;;
;;; terms and conditions on the use of this software, and must be ;;;
;;; distributed with any copies. THIS FILE MAY BE REDISTRIBUTED. This ;;;
;;; copyright and notice must be preserved in all copies made of this file.;;;
;;;————————————————————————-;;;
Interpretation of AutoClass Results:
Now you have run AutoClass on your data set – what have you got? Typically,
the AutoClass search procedure finds many classifications, but only saves the
few best. These are now available for inspection and interpretation. The most
important indicator of the relative merits of these alternative classifications
is Log total posterior probability value. Note that since the probability
lies between 0 and 1, the Log probability is negative and ranges from negative
infinity and 0. The difference between these Log probability values raised to
the power e gives the relative probability of the alternatives
classifications. So a difference of say 100 implies one classification is
e^100 more likely than the other. However, these numbers can be very
misleading, since they give the relative probability of alternative
classifications under the AutoClass ***assumptions***.
Specifically, the most important AutoClass assumptions are the use of normal
models for real variables, and the assumption of independence of attributes
within a class. Since these assumptions are often violated in practice, the
difference in posterior probability of alternative classifications can be
partly due to one classification being closer to satisfying the assumptions
than another, rather than to a real difference in classification quality.
Another source of uncertainty about the utility of Log probability values is
that they do not take into account any specific prior knowledge the user may
have about the domain. This means that it is often worth looking at
alternative classifications to see if you can interpret them, but it is worth
starting from the most probable first. Note that if the Log probability value
is much less than that for the one class case, it is saying that there is
overwhelming evidence for ***some*** structure in the data, and part of this
structure has been captured by the AutoClass classification.
So you have now picked a classification you want to examine, based on its Log
probability value; how do you examine it? The first thing to do is to
generate an "influence" report on the classification using the report
generation facilities documented in /doc/reports.text. An influence report
is designed to summarize the important information buried in the AutoClass
data structures. The first part of this report is a listing of the overall
"influence" of each of the attributes used in the classification. These
influence values are a weighted average of the "influence" of each attribute
in each class, as described below. The next part of the report is a summary
description of each of the classes. The classes are arbitrarily numbered from
0 up to n, in order of descending class weights. A class weight of say 34.1
means that the weighted sum of membership probabilities for class is 34.1.
Note that a class weight of 34 does not mean that 34 cases belong to that
class, since many cases may have only partial membership in that class.
In the report on each class, the attribute parameters for that class are
given in order of highest influence value. The "influence" value of an
attribute in a class is an information measure that roughly indicates how
informative that attribute is in describing a class – i.e., an indication of
the distinguishing attributes for that class. Only the first few attributes
usually have significant influence value. If an influence value drops below
about 20